Documentation for read trimming, mapping, variant calling, and SNP filtering can be accessed here. Four outlier detection programs were used to identify outlier SNPs (putatively under selection). pcadapt, OutFLANK, and BayeScan identify outliers based on population structure. LFMM2 identifies outliers based on associations with environmental predictors.
In terminal:
In NB_ddhaplo_working directory, make new directory for
outlier detection and link vcf files and popmap file
$ mkdir NB_OutlierDetection_working
$ cd NB_OutlierDetection_working
$ cp ../NB_SNPFiltering_working/SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf.gz .
$ cp ../NB_SNPFiltering_working/popmap .
Unzipping gzipped vcf file
$ gzip -d SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf.gz
In R:
# install and load pcadapt
#install.packages("pcadapt")
library(pcadapt)
# Convert VCF file to pcadapt format
vcf2pcadapt("SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf", output = "tmp.pcadapt", allele.sep = c("/", "|"))
filename <- read.pcadapt("tmp.pcadapt", type = "pcadapt")
# Choosing the number K of Principal Components
# We start off with a large number of PCs
x6 <- pcadapt(input = filename, K = 20)
# Plot the likelihoods using a screeplot
plot(x6, option = "screeplot")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the pcadapt package.
## Please report the issue at <https://github.com/bcm-uga/pcadapt/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Plotting likelihoods again, but with 10 Principal Components
plot(x6, option = "screeplot", K = 10)
# Create population designations minus NAR, NAR2, GHP
poplist.names6 <- c(rep("BAR", 10),rep("BIS", 10),rep("GB", 10), rep("KIC", 10),rep("MCD", 10), rep("PVD", 10))
# Score Plot
# Plot the first two PCs
plot(x6, option = "scores", pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
# PCA plot with PCs 3 and 4
plot(x6, option = "scores", i = 3, j = 4, pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
#evaluating LD in dataset - looking to see if loading are clustered in certain genomic regions (following https://bcm-uga.github.io/pcadapt/articles/pcadapt.html#f--linkage-disequilibrium-ld-thinning)
par(mfrow = c(2, 2))
for (i in 1:4)
plot(x6$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
Top left, PC1, is determined by one (maybe 2) genomic regions, also likely still an inversion. To minimize outlier detection based on LD, I can apply LD thinning. I’m going to use the default values for window size and r2 threshold for pcadapt (size = 500, thr = 0.1).
# performing SNP thinning in order to compute PCs
res6 <- pcadapt(filename, K = 10, LD.clumping = list(size = 500, thr = 0.1))
plot(res6, option = "screeplot")
Based on this scree plot, it’s not super clear where the elbow is (where total variance starts to drop off). I’m going to try a few different PC values moving forward. After some exploratory testing, I am moving forward with K = 4.
# Score Plot
# Plot the first two PCs
plot(res6, option = "scores", pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
# PCA plot with PCs 3 and 4
plot(res6, option = "scores", i = 3, j = 4, pop = poplist.names6)
## Warning: Use of `df$Pop` is discouraged.
## ℹ Use `Pop` instead.
res6 <- pcadapt(filename, K = 4, LD.clumping = list(size = 500, thr = 0.1))
par(mfrow = c(1,1))
for (i in 1)
plot(res6$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
No evidence of the inversions show up in the PC1 loadings.
plot(res6)
We still see some evidence of the inversions in the manhattan plot, however, this may be due to stronger selective pressures in the inverted regions.
#evaluating LD in dataset - looking to see if loading are still clustered in certain genomic regions (following https://bcm-uga.github.io/pcadapt/articles/pcadapt.html#f--linkage-disequilibrium-ld-thinning)
par(mfrow = c(2, 2))
for (i in 1:4)
plot(res6$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
res6$singular.values
## [1] 0.05504607 0.05443791 0.05423507 0.05373564
# Create Q-Q Plot
plot(res6, option = "qqplot", threshold = 0.1)
hist(res6$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
# Look at p-value distribution
plot(res6, option = "stat.distribution")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## ℹ The deprecated feature was likely used in the pcadapt package.
## Please report the issue at <https://github.com/bcm-uga/pcadapt/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_bin()`).
library(qvalue) # transforms p-values into q-values.
qval6 <- qvalue(res6$pvalues)$qvalues
alpha6 <- 0.1 # expected false discovery rate lower than 1% - raised from 0.05
outliers6 <- which(qval6 < alpha6)
outliers6
## [1] 154 171 172 173 174 751 947 1652 1662 2865 5329 6505
## [13] 8127 8206 8334 8474 8477 8482 10559 11546 13494 15497 16911 16963
## [25] 16967 16968 16987 16988 16990 16991 17042 17142 17439 17445 17616 17629
## [37] 17780 17784 17785 17789 17808 17809 18650 18663 18679 18709 18761 18808
## [49] 18812 18887 18890 18924 19011 19091 19094 19125 19130 19134
With a K = 4, and LD clumping with a window size = 500 bp and squared correlation threshold = 0.1, 58 outlier SNPs were identified.
# Save outliers to .txt file
invisible(lapply(outliers6, write, "outliers_pcadapt_thinned500.txt", append=TRUE))
# Load all necessary packages
library(OutFLANK) # outflank package
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.12.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(bigsnpr) # package for Linkage Disequilibrium pruning
## Loading required package: bigstatsr
# Convert VCF to OutFLANK format
my_vcf6 <- read.vcfR("SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf")
## Scanning file to determine attributes.
## File attributes:
## meta lines: 78
## header_line: 79
## variant count: 22874
## column count: 69
##
Meta line 78 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 22874
## Character matrix gt cols: 69
## skip: 0
## nrows: 22874
## row_num: 0
##
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant 9000
Processed variant 10000
Processed variant 11000
Processed variant 12000
Processed variant 13000
Processed variant 14000
Processed variant 15000
Processed variant 16000
Processed variant 17000
Processed variant 18000
Processed variant 19000
Processed variant 20000
Processed variant 21000
Processed variant 22000
Processed variant: 22874
## All variants processed
geno6 <- extract.gt(my_vcf6) # Character matrix containing the genotypes
position6 <- getPOS(my_vcf6) # Positions in bp
chromosome6 <- getCHROM(my_vcf6) # Chromosome information
G6 <- matrix(NA, nrow = nrow(geno6), ncol = ncol(geno6))
G6[geno6 %in% c("0/0", "0|0")] <- 0
G6[geno6 %in% c("0/1", "1/0", "1|0", "0|1")] <- 1
G6[geno6 %in% c("1/1", "1|1")] <- 2
# NA should be replaced with “9” to work with the functions in the OutFLANK package
G6[is.na(G6)] <- 9
head(G6[,1:10])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 1 0 0 0 0 0 0 0
## [3,] 0 0 1 0 0 0 0 0 0 0
## [4,] 0 0 2 0 0 0 0 0 0 0
## [5,] 0 0 2 0 0 0 0 0 0 0
## [6,] 0 0 0 0 1 0 1 0 0 0
# Convert pop designations to a vector
pop6 <- as.vector(poplist.names6)
pop6
## [1] "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BAR" "BIS" "BIS"
## [13] "BIS" "BIS" "BIS" "BIS" "BIS" "BIS" "BIS" "BIS" "GB" "GB" "GB" "GB"
## [25] "GB" "GB" "GB" "GB" "GB" "GB" "KIC" "KIC" "KIC" "KIC" "KIC" "KIC"
## [37] "KIC" "KIC" "KIC" "KIC" "MCD" "MCD" "MCD" "MCD" "MCD" "MCD" "MCD" "MCD"
## [49] "MCD" "MCD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD" "PVD"
# Calculate Fst
my_fst6 <- MakeDiploidFSTMat(t(G6), locusNames = paste0(chromosome6,"_", position6), popNames = pop6)
## Calculating FSTs, may take a few minutes...
## [1] "10000 done of 22874"
## [1] "20000 done of 22874"
head(my_fst6)
## LocusName He FST T1 T2 FSTNoCorr
## 1 NC_035789.1_147427 0.1116059 0.05531086 0.003146064 0.05687967 0.113516624
## 2 NC_035789.1_147498 0.1098611 -0.04377104 -0.002407407 0.05500000 0.007575758
## 3 NC_035789.1_147507 0.1244444 -0.03980986 -0.002481481 0.06233333 0.010695187
## 4 NC_035789.1_147556 0.1551278 -0.05293398 -0.004112412 0.07768946 0.009789265
## 5 NC_035789.1_147576 0.1409078 -0.03605094 -0.002551254 0.07076804 0.026704323
## 6 NC_035789.1_147582 0.1116059 -0.04495529 -0.002511675 0.05587052 0.007337601
## T1NoCorr T2NoCorr meanAlleleFreq
## 1 0.0064568966 0.05688063 0.9406780
## 2 0.0004166667 0.05500000 0.9416667
## 3 0.0006666667 0.06233333 0.9333333
## 4 0.0007605364 0.07769086 0.9152542
## 5 0.0018898467 0.07076932 0.9237288
## 6 0.0004099617 0.05587136 0.9406780
# the output will be a table showing Fst statistics for each locus
# Plot heterozygosity vs Fst
plot(my_fst6$He, my_fst6$FST)
# Plot Fst vs FstNoCorr
plot(my_fst6$FST, my_fst6$FSTNoCorr)
abline(0,1)
# Load additional packages
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("stringr")
# Chromosomes need to be of class integer for this to work.
# Visualizing chromosomes
chrom_unique6 <- unique(chromosome6)
print(chrom_unique6)
## [1] "NC_035789.1" "NC_035780.1" "NC_035781.1" "NC_035782.1" "NC_035783.1"
## [6] "NC_035784.1" "NC_035785.1" "NC_035786.1" "NC_035787.1" "NC_035788.1"
# Removing "NC_" from the chromosome name so it can be converted to an integer
chrom_new6 <- chromosome6%>%str_replace("NC_","")
# Converting the character string into an integer
chrom6 <- as.integer(chrom_new6)
#chromosome and position need to be sorted for imputation to work.
chrom_sort6 <- sort(chrom6)
pos_sort6 <- sort(position6)
Using snp_fastImputeSimple to impute missing
genotypes
# Converting the genotype matrix to class FBM.code256
G1_6 <- add_code256(big_copy(t(G6),type="raw"),code=bigsnpr:::CODE_012)
# Imputing missing genotypes
G_missSimple6 <- snp_fastImputeSimple(G1_6, method = "mode")
The method of imputation can be either “random” (sampling according to allele frequencies), “mean0” (rounded mean), “mean2” (rounded mean to 2 decimal places), “mode” (most frequent call). “Random” makes the most sense to me because imputation would be based on similar allele frequencies (more likely to fill in potential outliers) rather than generalized based on mean or most frequently called. I tested all four methods and 0 outliers were identified for all, so it doesn’t seem to make a difference in this data set.
newpc6 <-snp_autoSVD(G_missSimple6,infos.chr = chrom_sort6,infos.pos = pos_sort6)
##
## Phase of clumping (on MAF) at r^2 > 0.2.. keep 12015 SNPs.
## Discarding 3137 variants with MAC < 10.
##
## Iteration 1:
## Computing SVD..
## 37 outlier variants detected..
## 1 long-range LD region detected..
##
## Iteration 2:
## Computing SVD..
## 54 outlier variants detected..
## 2 long-range LD regions detected..
##
## Iteration 3:
## Computing SVD..
## 0 outlier variant detected..
##
## Converged!
which_pruned6 <- attr(newpc6, which="subset") # Indexes of remaining SNPS after pruning
length(which_pruned6)
## [1] 8787
head(which_pruned6)
## [1] 4 7 10 11 12 19
out_trim6 <- OutFLANK(my_fst6[which_pruned6,], NumberOfSamples=60, qthreshold = 0.05, Hmin = 0.1)
str(out_trim6)
## List of 6
## $ FSTbar : num 0.000955
## $ FSTNoCorrbar : num 0.0486
## $ dfInferred : num 5.33
## $ numberLowFstOutliers : int 0
## $ numberHighFstOutliers: int 0
## $ results :'data.frame': 8787 obs. of 15 variables:
## ..$ LocusName : Factor w/ 22874 levels "NC_035780.1_10067423",..: 22591 22625 22628 22640 22641 22744 22757 22758 22783 22786 ...
## ..$ He : num [1:8787] 0.155 0.428 0.274 0.479 0.155 ...
## ..$ FST : num [1:8787] -0.05293 -0.00637 0.01919 -0.01497 0.01537 ...
## ..$ T1 : num [1:8787] -0.00411 -0.00137 0.00266 -0.0036 0.00121 ...
## ..$ T2 : num [1:8787] 0.0777 0.2156 0.1385 0.2405 0.0785 ...
## ..$ FSTNoCorr : num [1:8787] 0.00979 0.04399 0.06594 0.03073 0.0716 ...
## ..$ T1NoCorr : num [1:8787] 0.000761 0.009485 0.009132 0.007392 0.005621 ...
## ..$ T2NoCorr : num [1:8787] 0.0777 0.2156 0.1385 0.2405 0.0785 ...
## ..$ meanAlleleFreq : num [1:8787] 0.915 0.69 0.164 0.603 0.915 ...
## ..$ indexOrder : int [1:8787] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ GoodH : Factor w/ 1 level "goodH": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ qvalues : num [1:8787] 0.97 0.934 0.914 0.959 0.914 ...
## ..$ pvalues : num [1:8787] 0.0648 0.9616 0.4676 0.6297 0.3809 ...
## ..$ pvaluesRightTail: num [1:8787] 0.968 0.481 0.234 0.685 0.19 ...
## ..$ OutlierFlag : logi [1:8787] FALSE FALSE FALSE FALSE FALSE FALSE ...
# Check the fit and make sure it looks good, especially in the right tail:
OutFLANKResultsPlotter(out_trim6, withOutliers = TRUE, NoCorr = TRUE, Hmin = 0.1,binwidth = 0.001, Zoom = FALSE, RightZoomFraction = 0.05, titletext = NULL)
# Check the p-value histogram
hist(out_trim6$results$pvaluesRightTail)
# Using estimated neutral mean FST and df to calculate P-values for all loci
P1_6 <- pOutlierFinderChiSqNoCorr(my_fst6, Fstbar = out_trim6$FSTNoCorrbar, dfInferred = out_trim6$dfInferred, qthreshold = 0.05, Hmin=0.1)
# Identify outliers SNPs
my_out6 <- P1_6$OutlierFlag==TRUE
plot(P1_6$He, P1_6$FST, pch=19, col=rgb(0,0,0,0.1))
points(P1_6$He[my_out6], P1_6$FST[my_out6], col="blue")
P1_6[which(P1_6$OutlierFlag == TRUE),]
## [1] LocusName He FST T1
## [5] T2 FSTNoCorr T1NoCorr T2NoCorr
## [9] meanAlleleFreq pvalues pvaluesRightTail qvalues
## [13] OutlierFlag
## <0 rows> (or 0-length row.names)
outliers_out6 <- P1_6[which(P1_6$OutlierFlag == TRUE),]
outliers_list6 <- outliers_out6$LocusName
outliers_list6
## factor(0)
## 22874 Levels: NC_035780.1_10067423 NC_035780.1_1007505 ... NC_035789.1_9490358
No outliers detected
In terminal:
Convert VCF file to other outputs
# load in configuration file that will convert VCF to BayeScan format
$ curl -L -O https://raw.githubusercontent.com/amyzyck/EecSeq_NB_EasternOyster/master/Scripts/BSsnp.spid
$ chmod +x BSsnp.spid
$ java -jar /usr/local/bin/PGDSpider2-cli.jar -spid BSsnp.spid -inputfile SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf -outputfile SNP6p9g9.BS
output:
WARN 12:07:05 - PGDSpider configuration file not found! Loading default configuration.
initialize convert process...
read input file...
read input file done.
write output file...
write output file done.
Run BayeScan
$ BayeScan2.1_linux64bits SNP6p9g9.BS -thin 50 -nbp 30
In R:
source("BayeScan/plot_R.r")
plot_bayescan("BayeScan/SNP6p9g9_fst.txt")
## $outliers
## integer(0)
##
## $nb_outliers
## [1] 0
# Identifying outliers using a False discovery rate of 10%.
bs6p9g9 <- plot_bayescan("BayeScan/SNP6p9g9_fst.txt", FDR = 0.10)
bs6p9g9$outliers
## integer(0)
No outliers detected
Following steps documented by pgugger here to prep the genomic and environmental data.
#if (!require("BiocManager", quietly = TRUE))
#install.packages("BiocManager")
#BiocManager::install("LEA")
library(LEA)
For LFMM, the input genomic data needs to be SNPs as genotypes encoded 0, 1, or 2 for homozygous, heterozygous, and other homozygous, respectively. Missing data is coded as “9”.
In terminal:
# Converting vcf to genotyped file for input
$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf --012 --out snp6p9g9
# Replacing missing data with 9 and creating the .lfmm file
$ sed 's/-1/9/g' snp6p9g9.012 | cut -f2- > snp6p9g9.lfmm
I also have a vcf file with SNPs inside known inversion removed. I want to run LFMM2 on this data set too.
$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A_noinvert.recode.vcf --012 --out snp6p9g9_noinvert
$ sed 's/-1/9/g' snp6p9g9_noinvert.012 | cut -f2- > snp6p9g9_noinvert.lfmm
# run snmf to estimate K, considering K from 1-6:
project_6p9g9 = NULL
project_6p9g9 = snmf("LFMM2/snp6p9g9.lfmm", K = 1:10, entropy = TRUE, repetitions = 10, project = "new")
#pdf("sNMF_6p9g9.pdf")
plot(project_6p9g9, col = "blue", pch = 19, cex = 1.2)
#dev.off()
Based on the plot of cross-entropy, K = 1 has the lowest cross-entropy value.
# Generating a Structure-type plot. A K = 1 is not a very interesting plot to look at, so I also generated plots with K = 2-6. With K = 6, the individuals are spread across the 6 ancestry pops.
best_6p9g9 = which.min(cross.entropy(project_6p9g9, K = 6))
# pdf("sNMF.barchart_6p9g9.pdf")
barchart(project_6p9g9, K = 6, run = best_6p9g9, border = NA, space = 0, col = c("red", "blue","green","yellow","orange","pink"), xlab = "Individuals", ylab = "Ancestry proportions") -> bp_6p9g9
axis(1, at = 1:length(bp_6p9g9$order), labels = bp_6p9g9$order, las=1, cex.axis = .3)
#dev.off()
#imputing any missing data
project.missing_6p9g9 = snmf("LFMM2/snp6p9g9.lfmm", K = 1,
entropy = TRUE, repetitions = 10,
project = "new")
# select the run with the lowest cross-entropy value
best_6p9g9 = which.min(cross.entropy(project.missing_6p9g9, K = 1))
# Impute the missing genotypes
impute(project.missing_6p9g9, "LFMM2/snp6p9g9.lfmm",
method = 'random', K = 1, run = best_6p9g9)
## Missing genotype imputation for K = 1
## Missing genotype imputation for run = 9
## Results are written in the file: LFMM2/snp6p9g9.lfmm_imputed.lfmm
## Missing genotype imputation for K = 1
## Missing genotype imputation for run = 6
## Results are written in the file: snp_.lfmm_imputed.lfmm
# Proportion of correct imputation results
dat.imp_6p9g9 = read.lfmm("LFMM2/snp6p9g9.lfmm_imputed.lfmm")
#mean( tutorial.R[dat == 9] == dat.imp[dat == 9] )
Repeating with no invert dataset
# run snmf to estimate K, considering K from 1-6:
project_noinvert_6p9g9 = NULL
project_noinvert_6p9g9 = snmf("LFMM2/snp6p9g9_noinvert.lfmm", K = 1:10, entropy = TRUE, repetitions = 10, project = "new")
#pdf("sNMF_noinvert_6p9g9.pdf")
plot(project_noinvert_6p9g9, col = "blue", pch = 19, cex = 1.2)
#dev.off()
Based on the plot of cross-entropy, K = 1 has the lowest cross-entropy value.
# Generating a Structure-type plot. A K = 1 is not a very interesting plot to look at, so I also generated plots with K = 2-6. With K = 6, the individuals are spread across the 6 ancestry pops.
best_noinvert_6p9g9 = which.min(cross.entropy(project_noinvert_6p9g9, K = 6))
# pdf("sNMF.barchart_noinvert_6p9g9.pdf")
barchart(project_noinvert_6p9g9, K = 6, run = best_noinvert_6p9g9, border = NA, space = 0, col = c("red", "blue","green","yellow","orange","pink"), xlab = "Individuals", ylab = "Ancestry proportions") -> bp_noinvert_6p9g9
axis(1, at = 1:length(bp_noinvert_6p9g9$order), labels = bp_noinvert_6p9g9$order, las=1, cex.axis = .3)
#dev.off()
#imputing any missing data
project.missing_noinvert_6p9g9 = snmf("LFMM2/snp6p9g9_noinvert.lfmm", K = 1,
entropy = TRUE, repetitions = 10,
project = "new")
# select the run with the lowest cross-entropy value
best_noinvert_6p9g9 = which.min(cross.entropy(project.missing_noinvert_6p9g9, K = 1))
# Impute the missing genotypes
impute(project.missing_noinvert_6p9g9, "LFMM2/snp6p9g9_noinvert.lfmm",
method = 'mode', K = 1, run = best_noinvert_6p9g9)
## Missing genotype imputation for K = 1
## Missing genotype imputation for run = 5
## Results are written in the file: LFMM2/snp6p9g9_noinvert.lfmm_imputed.lfmm
## Missing genotype imputation for K = 1
## Missing genotype imputation for run = 6
## Results are written in the file: snp_.lfmm_imputed.lfmm
# Proportion of correct imputation results
dat.imp_noinvert_6p9g9 = read.lfmm("LFMM2/snp6p9g9_noinvert.lfmm_imputed.lfmm")
#mean( tutorial.R[dat == 9] == dat.imp[dat == 9] )
# Prepping the environmental data
clim.env_6pops_pca_red <- read.table("./env_data_pca_reduced", header=TRUE)
clim.env_6pops_pca_red <- clim.env_6pops_pca_red[,5:17]
colnames(clim.env_6pops_pca_red) <- NULL
row.names(clim.env_6pops_pca_red) <- NULL
clim.env_6pops_pca_red
##
## 1 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 2 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 3 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 4 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 5 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 6 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 7 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 8 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 9 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 10 17.881340 12.20 30.50 28.50 21.10 24.35 27.85 25.00 7.69 5.71 -3.6501277
## 11 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 12 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 13 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 14 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 15 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 16 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 17 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 18 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 19 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 20 8.824636 9.24 29.44 31.58 20.63 24.51 26.63 14.19 7.94 8.09 1.8262205
## 21 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 22 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 23 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 24 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 25 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 26 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 27 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 28 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 29 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 30 14.596049 11.40 31.00 29.00 20.00 24.83 28.32 22.00 7.67 5.61 -3.0116343
## 31 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 32 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 33 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 34 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 35 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 36 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 37 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 38 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 39 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 40 56.312594 11.90 30.00 27.70 21.70 24.21 28.16 26.00 7.84 6.27 -3.2867056
## 41 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 42 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 43 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 44 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 45 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 46 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 47 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 48 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 49 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 50 12.111228 6.49 25.84 33.94 21.57 26.39 21.71 9.35 7.69 7.53 7.8111068
## 51 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 52 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 53 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 54 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 55 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 56 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 57 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 58 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 59 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
## 60 59.860038 8.78 29.10 27.99 19.73 24.45 18.66 5.01 7.63 6.95 0.3111403
##
## 1 1.998418 -0.8784983
## 2 1.998418 -0.8784983
## 3 1.998418 -0.8784983
## 4 1.998418 -0.8784983
## 5 1.998418 -0.8784983
## 6 1.998418 -0.8784983
## 7 1.998418 -0.8784983
## 8 1.998418 -0.8784983
## 9 1.998418 -0.8784983
## 10 1.998418 -0.8784983
## 11 -2.447613 3.3327135
## 12 -2.447613 3.3327135
## 13 -2.447613 3.3327135
## 14 -2.447613 3.3327135
## 15 -2.447613 3.3327135
## 16 -2.447613 3.3327135
## 17 -2.447613 3.3327135
## 18 -2.447613 3.3327135
## 19 -2.447613 3.3327135
## 20 -2.447613 3.3327135
## 21 3.079209 -0.6975026
## 22 3.079209 -0.6975026
## 23 3.079209 -0.6975026
## 24 3.079209 -0.6975026
## 25 3.079209 -0.6975026
## 26 3.079209 -0.6975026
## 27 3.079209 -0.6975026
## 28 3.079209 -0.6975026
## 29 3.079209 -0.6975026
## 30 3.079209 -0.6975026
## 31 -1.122021 2.4790058
## 32 -1.122021 2.4790058
## 33 -1.122021 2.4790058
## 34 -1.122021 2.4790058
## 35 -1.122021 2.4790058
## 36 -1.122021 2.4790058
## 37 -1.122021 2.4790058
## 38 -1.122021 2.4790058
## 39 -1.122021 2.4790058
## 40 -1.122021 2.4790058
## 41 2.375912 -0.2570490
## 42 2.375912 -0.2570490
## 43 2.375912 -0.2570490
## 44 2.375912 -0.2570490
## 45 2.375912 -0.2570490
## 46 2.375912 -0.2570490
## 47 2.375912 -0.2570490
## 48 2.375912 -0.2570490
## 49 2.375912 -0.2570490
## 50 2.375912 -0.2570490
## 51 -3.883906 -3.9786694
## 52 -3.883906 -3.9786694
## 53 -3.883906 -3.9786694
## 54 -3.883906 -3.9786694
## 55 -3.883906 -3.9786694
## 56 -3.883906 -3.9786694
## 57 -3.883906 -3.9786694
## 58 -3.883906 -3.9786694
## 59 -3.883906 -3.9786694
## 60 -3.883906 -3.9786694
write.table(clim.env_6pops_pca_red, "/home/azyck/NB_capture_both/NB_ddhaplo/NB_ddhaplo_working/NB_OutlierDetection_working/clim.env_6pops_pca_red.env", sep="\t", quote=F, row.names=F)
# Prepping a second env file for downstream processing of LFMM output
clim.env_6pops_pca_red_full <- read.table("./env_data_pca_reduced", header=TRUE)
env_6pops_pca_red <- clim.env_6pops_pca_red_full[,5:17]
I am following steps documented by B.R. Forester here for running the LFMM ridge model.
# if(!requireNamespace("qvalue", quietly = TRUE)) {
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install(version = "3.14")
# BiocManager::install("qvalue")
# }
# if(!requireNamespace("lfmm", quietly = TRUE)) {
# remotes::install_github("bcm-uga/lfmm")
# }
library(vegan) # Used to run PCA & RDA
## Loading required package: permute
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:LEA':
##
## barchart
## This is vegan 2.5-7
library(lfmm) # Used to run LFMM
library(qvalue) # Used to post-process LFMM output
library(vcfR)
library(stringr)
# to make it tab separated - sed 's/ \+/\t/g' snp6p9g9.lfmm_imputed.lfmm > snp6p9g9.lfmm_imputed_tab.lfmm
gen6p9g9<-read.delim("LFMM2/snp6p9g9.lfmm_imputed_tab.lfmm",header = FALSE)
row.names(gen6p9g9) <- c("BAR_10","BAR_1","BAR_2","BAR_3","BAR_4","BAR_5","BAR_6","BAR_7","BAR_8","BAR_9","BIS_10","BIS_1","BIS_2","BIS_3","BIS_4","BIS_5","BIS_6","BIS_7","BIS_8","BIS_9","GB_10","GB_1","GB_2","GB_3","GB_4","GB_5","GB_6","GB_7","GB_8","GB_9","KIC_10","KIC_1","KIC_2","KIC_3","KIC_4","KIC_5","KIC_6","KIC_7","KIC_8","KIC_9","MCD_10","MCD_1","MCD_2","MCD_3","MCD_4","MCD_5","MCD_6","MCD_7","MCD_8","MCD_9","PVD_10","PVD_1","PVD_2","PVD_3","PVD_4","PVD_5","PVD_6","PVD_7","PVD_8","PVD_9") # Adding individual names to rows of matrix
dim(gen6p9g9)
## [1] 60 22874
#LFMM requires a complete dataframe. I am using the snp_6.lfmm_imputed_tab.lfmm file created for running lfmm above, which imputes missing data based on the most common genotype for each SNP. I will move forward with this and see what happens
# No NAs
sum(is.na(gen6p9g9))
## [1] 0
Repeat for no_invert dataset
# to make it tab separated - sed 's/ \+/\t/g' snp6p9g9_noinvert.lfmm_imputed.lfmm > snp6p9g9_noinvert.lfmm_imputed_tab.lfmm
gen_noinvert_6p9g9<-read.delim("LFMM2/snp6p9g9_noinvert.lfmm_imputed_tab.lfmm",header = FALSE)
row.names(gen_noinvert_6p9g9) <- c("BAR_10","BAR_1","BAR_2","BAR_3","BAR_4","BAR_5","BAR_6","BAR_7","BAR_8","BAR_9","BIS_10","BIS_1","BIS_2","BIS_3","BIS_4","BIS_5","BIS_6","BIS_7","BIS_8","BIS_9","GB_10","GB_1","GB_2","GB_3","GB_4","GB_5","GB_6","GB_7","GB_8","GB_9","KIC_10","KIC_1","KIC_2","KIC_3","KIC_4","KIC_5","KIC_6","KIC_7","KIC_8","KIC_9","MCD_10","MCD_1","MCD_2","MCD_3","MCD_4","MCD_5","MCD_6","MCD_7","MCD_8","MCD_9","PVD_10","PVD_1","PVD_2","PVD_3","PVD_4","PVD_5","PVD_6","PVD_7","PVD_8","PVD_9") # Adding individual names to rows of matrix
dim(gen_noinvert_6p9g9)
## [1] 60 21070
#LFMM requires a complete dataframe. I am using the snp_6.lfmm_imputed_tab.lfmm file created for running lfmm above, which imputes missing data based on the most common genotype for each SNP. I will move forward with this and see what happens
# No NAs
sum(is.na(gen_noinvert_6p9g9))
## [1] 0
env6p9g9_red.lfmm <- clim.env_6pops_pca_red_full
env6p9g9_red.lfmm$Individual <- c("BAR_10","BAR_1","BAR_2","BAR_3","BAR_4","BAR_5","BAR_6","BAR_7","BAR_8","BAR_9","BIS_10","BIS_1","BIS_2","BIS_3","BIS_4","BIS_5","BIS_6","BIS_7","BIS_8","BIS_9","GB_10","GB_1","GB_2","GB_3","GB_4","GB_5","GB_6","GB_7","GB_8","GB_9","KIC_10","KIC_1","KIC_2","KIC_3","KIC_4","KIC_5","KIC_6","KIC_7","KIC_8","KIC_9","MCD_10","MCD_1","MCD_2","MCD_3","MCD_4","MCD_5","MCD_6","MCD_7","MCD_8","MCD_9","PVD_10","PVD_1","PVD_2","PVD_3","PVD_4","PVD_5","PVD_6","PVD_7","PVD_8","PVD_9")
str(env6p9g9_red.lfmm)
## 'data.frame': 60 obs. of 17 variables:
## $ Individual : chr "BAR_10" "BAR_1" "BAR_2" "BAR_3" ...
## $ Population : Factor w/ 6 levels "BAR","BIS","GB",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Latitude : num 41.7 41.7 41.7 41.7 41.7 ...
## $ Longitude : num -71.3 -71.3 -71.3 -71.3 -71.3 ...
## $ SewageEffluent : num 17.9 17.9 17.9 17.9 17.9 ...
## $ Temp_Min : num 12.2 12.2 12.2 12.2 12.2 12.2 12.2 12.2 12.2 12.2 ...
## $ June_Salinity_Max: num 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 ...
## $ Temp_Max : num 28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 ...
## $ July_Temp_Min : num 21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 ...
## $ July_Temp_Avg : num 24.4 24.4 24.4 24.4 24.4 ...
## $ June_Salinity_Avg: num 27.9 27.9 27.9 27.9 27.9 ...
## $ June_Salinity_Min: num 25 25 25 25 25 25 25 25 25 25 ...
## $ pH_Avg : num 7.69 7.69 7.69 7.69 7.69 7.69 7.69 7.69 7.69 7.69 ...
## $ June_DO_Avg : num 5.71 5.71 5.71 5.71 5.71 5.71 5.71 5.71 5.71 5.71 ...
## $ PC1 : num -3.65 -3.65 -3.65 -3.65 -3.65 ...
## $ PC2 : num 2 2 2 2 2 ...
## $ PC3 : num -0.878 -0.878 -0.878 -0.878 -0.878 ...
#subsetting the environmental dataset to just include the environmental variables
pred6p9g9_red <- env_6pops_pca_red
pred6p9g9_red_noPCs <- env_6pops_pca_red[,1:10]
# Running a PCA on the environmental variables. We can use the first PC as a synthetic predictor
pred6p9g9_red.pca <- rda(pred6p9g9_red_noPCs, scale=T)
summary(pred6p9g9_red.pca)$cont
## $importance
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Eigenvalue 5.1384 2.3949 1.3488 1.0301 0.087742
## Proportion Explained 0.5138 0.2395 0.1349 0.1030 0.008774
## Cumulative Proportion 0.5138 0.7533 0.8882 0.9912 1.000000
#plotting PCA
plot(pred6p9g9_red.pca)
screeplot(pred6p9g9_red.pca, main = "Screeplot: Eigenvalues of Oyster Predictor Variables")
## correlations between the PC axis and predictors:
round(scores(pred6p9g9_red.pca, choices=1:5, display="species", scaling=0), digits=3)
## PC1 PC2 PC3 PC4 PC5
## SewageEffluent 0.111 -0.429 0.293 -0.591 0.485
## Temp_Min 0.438 0.063 -0.046 -0.018 -0.187
## June_Salinity_Max 0.414 -0.063 0.074 0.312 0.061
## Temp_Max -0.373 0.322 -0.088 0.159 0.054
## July_Temp_Min -0.041 0.454 -0.040 -0.685 -0.389
## July_Temp_Avg -0.363 0.149 -0.419 -0.070 0.572
## June_Salinity_Avg 0.317 0.434 -0.055 0.129 0.366
## June_Salinity_Min 0.364 0.328 -0.167 -0.151 0.109
## pH_Avg 0.039 0.414 0.651 0.049 0.282
## June_DO_Avg -0.342 0.100 0.516 0.109 -0.140
## attr(,"const")
## [1] 4.92848
PC1 explains 51% of the variation, PC2 explains 24% and PC3 explains 13%. The strongest correlations with PC1 are Temp_Min, June_Salinity_Max and Temp_Max
I could store the synthetic PC axis predictor as pred.PC1 and do the same for PC2 and PC3 as well. There’s an option to run LFMM using these PC scores as the predictors. I can also run LFMM on each individual environmental predictor. I’m going to opt for the latter as I’d like to see which predictors are associated with outlier SNPs. I’ve also already saved the PC 1-3 scores for the full environmetnal dataset. The next section will be very repetitive as I’m going to repeat the set of code for each predictor.
# Save each predictor as its own variable
# Sewage Effluent
SE <- pred6p9g9_red$SewageEffluent
# Min Temp
Min_T <- pred6p9g9_red$Temp_Min
# Max June Salinity
Max_Sal_Jun <- pred6p9g9_red$June_Salinity_Max
# Max Temp
Max_T <- pred6p9g9_red$Temp_Max
# Max June Salinity
Min_T_Jul <- pred6p9g9_red$July_Temp_Min
# Avg Temp July
Avg_T_Jul <- pred6p9g9_red$July_Temp_Avg
# Avg Sal Jun
Avg_Sal_Jun <- pred6p9g9_red$June_Salinity_Avg
# Min June Salinity
Min_Sal_Jun <- pred6p9g9_red$June_Salinity_Min
# Avg pH
Avg_pH <- pred6p9g9_red$pH_Avg
# DO Avg June
Avg_DO_Jun <- pred6p9g9_red$June_DO_Avg
# PC1
PC1 <- pred6p9g9_red$PC1
# PC2
PC2 <- pred6p9g9_red$PC2
# PC3
PC3 <- pred6p9g9_red$PC3
#Using broken stick criterion to determine k - PCs should be retained as long as observed eigenvalues are higher than corresponding random broken stick components
#Testing it out on the environmental data
screeplot(pred6p9g9_red.pca, main = "Screeplot of Oyster Predictor Variables with Broken Stick", bstick=TRUE, type="barplot")
PC1 explains more than the random broken stick components, while PC2 + do not. If this were genomic data, and we were determining a value of K using this approach, we’d set K = 1.
# Now looking at the genetic data
gen6p9g9.pca <- rda(gen6p9g9, scale=T)
screeplot(gen6p9g9.pca, main = "Screeplot of Genetic Data with Broken Stick", bstick=TRUE, type="barplot")
For the genomic data, we can see that none of the PCs have eigenvalues greater than random (greater than the broken stick values in red). This effectively means that K=1 for the genomic data set, based on a PCA assessment. I also tried K = 3 based on the scree plot method. A higher K increased the GIF value for all predictors. Having a K = 1 identified one additional outlier SNP for pH_Avg and PC2, so I will move forward with K = 1 for the LFMM2 analysis.
Repeating for no invert dataset
# Now looking at the genetic data
gen_noinvert_6p9g9.pca <- rda(gen_noinvert_6p9g9, scale=T)
screeplot(gen_noinvert_6p9g9.pca, main = "Screeplot of Genetic Data with Broken Stick", bstick=TRUE, type="barplot")
For the genomic data, we can see that none of the PCs have eigenvalues greater than random (greater than the broken stick values in red). This effectively means that K=1 for the genomic data set, based on a PCA assessment.
K <- 1
Sewage Effluent
oys6p9g9_red_SE.lfmm <- lfmm_ridge(Y=gen6p9g9, X=SE, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_SE.pv <- lfmm_test(Y=gen6p9g9, X=SE, lfmm=oys6p9g9_red_SE.lfmm, calibrate="gif")
names(oys6p9g9_red_SE.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_SE.pv$gif
## [1] 1.078058
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_SE <- oys6p9g9_red_SE.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_SE <- oys6p9g9_red_SE.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.078058
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_SE <- pchisq(zscore_6p9g9_red_SE^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")
hist(adj.pv6p9g9_red_SE, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_SE.qv <- qvalue(oys6p9g9_red_SE.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_SE.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys6p9g9_red_SE.FDR.1 <- which(oys6p9g9_red_SE.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_SE.FDR.1
## integer(0)
Temp Min
oys6p9g9_red_Min_T.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Min_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Min_T.pv <- lfmm_test(Y=gen6p9g9, X=Min_T, lfmm=oys6p9g9_red_Min_T.lfmm, calibrate="gif")
names(oys6p9g9_red_Min_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys6p9g9_red_Min_T.pv$gif
## [1] 1.071593
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Min_T <- oys6p9g9_red_Min_T.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Min_T <- oys6p9g9_red_Min_T.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.071593
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Min_T <- pchisq(zscore_6p9g9_red_Min_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")
hist(adj.pv6p9g9_red_Min_T, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Min_T.qv <- qvalue(oys6p9g9_red_Min_T.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Min_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Min_T.FDR.1 <- which(oys6p9g9_red_Min_T.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Min_T.FDR.1
## [1] 12105
invisible(lapply(oys6p9g9_red_Min_T.FDR.1, write, "outliers_lfmm2_6p9g9_red_Min_T.txt", append=TRUE))
June Salinity Max
oys6p9g9_red_Max_Sal_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Max_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Max_Sal_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Max_Sal_Jun, lfmm=oys6p9g9_red_Max_Sal_Jun.lfmm, calibrate="gif")
names(oys6p9g9_red_Max_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.11.
oys6p9g9_red_Max_Sal_Jun.pv$gif
## [1] 1.110477
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Max_Sal_Jun <- oys6p9g9_red_Max_Sal_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Max_Sal_Jun <- oys6p9g9_red_Max_Sal_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.110477
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Max_Sal_Jun <- pchisq(zscore_6p9g9_red_Max_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.11)")
hist(adj.pv6p9g9_red_Max_Sal_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Max_Sal_Jun.qv <- qvalue(oys6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Max_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_Max_Sal_Jun.FDR.1 <- which(oys6p9g9_red_Max_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Max_Sal_Jun.FDR.1
## [1] 1364 12105
invisible(lapply(oys6p9g9_red_Max_Sal_Jun.FDR.1, write, "outliers_lfmm2_6p9g9_red_Max_Sal_Jun.txt", append=TRUE))
Temp Max
oys6p9g9_red_Max_T.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Max_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Max_T.pv <- lfmm_test(Y=gen6p9g9, X=Max_T, lfmm=oys6p9g9_red_Max_T.lfmm, calibrate="gif")
names(oys6p9g9_red_Max_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_Max_T.pv$gif
## [1] 1.082925
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Max_T <- oys6p9g9_red_Max_T.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Max_T <- oys6p9g9_red_Max_T.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.082925
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Max_T <- pchisq(zscore_6p9g9_red_Max_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")
hist(adj.pv6p9g9_red_Max_T, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Max_T.qv <- qvalue(oys6p9g9_red_Max_T.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Max_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Max_T.FDR.1 <- which(oys6p9g9_red_Max_T.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Max_T.FDR.1
## [1] 20651
invisible(lapply(oys6p9g9_red_Max_T.FDR.1, write, "outliers_lfmm2_6p9g9_red_Max_T.txt", append=TRUE))
July Temp Min
oys6p9g9_red_Min_T_Jul.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Min_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Min_T_Jul.pv <- lfmm_test(Y=gen6p9g9, X=Min_T_Jul, lfmm=oys6p9g9_red_Min_T_Jul.lfmm, calibrate="gif")
names(oys6p9g9_red_Min_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.06, which I'm happy with.
oys6p9g9_red_Min_T_Jul.pv$gif
## [1] 1.055777
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Min_T_Jul <- oys6p9g9_red_Min_T_Jul.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Min_T_Jul <- oys6p9g9_red_Min_T_Jul.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.055777
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Min_T_Jul <- pchisq(zscore_6p9g9_red_Min_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.06)")
hist(adj.pv6p9g9_red_Min_T_Jul, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Min_T_Jul.qv <- qvalue(oys6p9g9_red_Min_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Min_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_Min_T_Jul.FDR.1 <- which(oys6p9g9_red_Min_T_Jul.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Min_T_Jul.FDR.1
## [1] 12372 19400
invisible(lapply(oys6p9g9_red_Min_T_Jul.FDR.1, write, "outliers_lfmm2_6p9g9_red_Min_T_Jul.txt", append=TRUE))
July Temp Avg
oys6p9g9_red_Avg_T_Jul.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_T_Jul.pv <- lfmm_test(Y=gen6p9g9, X=Avg_T_Jul, lfmm=oys6p9g9_red_Avg_T_Jul.lfmm, calibrate="gif")
names(oys6p9g9_red_Avg_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.13
oys6p9g9_red_Avg_T_Jul.pv$gif
## [1] 1.131646
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_T_Jul <- oys6p9g9_red_Avg_T_Jul.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_T_Jul <- oys6p9g9_red_Avg_T_Jul.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.131646
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_T_Jul <- pchisq(zscore_6p9g9_red_Avg_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.13)")
hist(adj.pv6p9g9_red_Avg_T_Jul, main="REadjusted p-values (GIF=1.07)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_T_Jul.qv <- qvalue(oys6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 3
oys6p9g9_red_Avg_T_Jul.FDR.1 <- which(oys6p9g9_red_Avg_T_Jul.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_T_Jul.FDR.1
## [1] 13926 13927 20651
invisible(lapply(oys6p9g9_red_Avg_T_Jul.FDR.1, write, "outliers_lfmm2_6p9g9_red_Avg_T_Jul.txt", append=TRUE))
June Salinity Avg
oys6p9g9_red_Avg_Sal_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_Sal_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Avg_Sal_Jun, lfmm=oys6p9g9_red_Avg_Sal_Jun.lfmm, calibrate="gif")
names(oys6p9g9_red_Avg_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_Avg_Sal_Jun.pv$gif
## [1] 1.077006
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_Sal_Jun <- oys6p9g9_red_Avg_Sal_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_Sal_Jun <- oys6p9g9_red_Avg_Sal_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.077006
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_Sal_Jun <- pchisq(zscore_6p9g9_red_Avg_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")
hist(adj.pv6p9g9_red_Avg_Sal_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_Sal_Jun.qv <- qvalue(oys6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_Avg_Sal_Jun.FDR.1 <- which(oys6p9g9_red_Avg_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_Sal_Jun.FDR.1
## [1] 10757 17899
invisible(lapply(oys6p9g9_red_Avg_Sal_Jun.FDR.1, write, "outliers_lfmm2_6p9g9_red_Avg_Sal_Jun.txt", append=TRUE))
June Salinity Min
oys6p9g9_red_Min_Sal_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Min_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Min_Sal_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Min_Sal_Jun, lfmm=oys6p9g9_red_Min_Sal_Jun.lfmm, calibrate="gif")
names(oys6p9g9_red_Min_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08, which I'm happy with.
oys6p9g9_red_Min_Sal_Jun.pv$gif
## [1] 1.077292
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Min_Sal_Jun <- oys6p9g9_red_Min_Sal_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Min_Sal_Jun <- oys6p9g9_red_Min_Sal_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.077292
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Min_Sal_Jun <- pchisq(zscore_6p9g9_red_Min_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")
hist(adj.pv6p9g9_red_Min_Sal_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Min_Sal_Jun.qv <- qvalue(oys6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Min_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Min_Sal_Jun.FDR.1 <- which(oys6p9g9_red_Min_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Min_Sal_Jun.FDR.1
## [1] 10757
invisible(lapply(oys6p9g9_red_Min_Sal_Jun.FDR.1, write, "outliers_lfmm2_6p9g9_red_Min_Sal_Jun.txt", append=TRUE))
pH Avg
oys6p9g9_red_Avg_pH.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_pH, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_pH.pv <- lfmm_test(Y=gen6p9g9, X=Avg_pH, lfmm=oys6p9g9_red_Avg_pH.lfmm, calibrate="gif")
names(oys6p9g9_red_Avg_pH.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys6p9g9_red_Avg_pH.pv$gif
## [1] 1.068287
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_pH <- oys6p9g9_red_Avg_pH.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_pH <- oys6p9g9_red_Avg_pH.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.068287
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_pH <- pchisq(zscore_6p9g9_red_Avg_pH^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")
hist(adj.pv6p9g9_red_Avg_pH, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_pH.qv <- qvalue(oys6p9g9_red_Avg_pH.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_pH.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys6p9g9_red_Avg_pH.FDR.1 <- which(oys6p9g9_red_Avg_pH.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_pH.FDR.1
## [1] 16988
invisible(lapply(oys6p9g9_red_Avg_pH.FDR.1, write, "outliers_lfmm2_6p9g9_red_Avg_pH.txt", append=TRUE))
June DO Avg
oys6p9g9_red_Avg_DO_Jun.lfmm <- lfmm_ridge(Y=gen6p9g9, X=Avg_DO_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_Avg_DO_Jun.pv <- lfmm_test(Y=gen6p9g9, X=Avg_DO_Jun, lfmm=oys6p9g9_red_Avg_DO_Jun.lfmm, calibrate="gif")
names(oys6p9g9_red_Avg_DO_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.12, which I'm happy with.
oys6p9g9_red_Avg_DO_Jun.pv$gif
## [1] 1.118573
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_Avg_DO_Jun <- oys6p9g9_red_Avg_DO_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_Avg_DO_Jun <- oys6p9g9_red_Avg_DO_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.118573
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_Avg_DO_Jun <- pchisq(zscore_6p9g9_red_Avg_DO_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.12)")
hist(adj.pv6p9g9_red_Avg_DO_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_Avg_DO_Jun.qv <- qvalue(oys6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_Avg_DO_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys6p9g9_red_Avg_DO_Jun.FDR.1 <- which(oys6p9g9_red_Avg_DO_Jun.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_Avg_DO_Jun.FDR.1
## integer(0)
PC1
oys6p9g9_red_PC1.lfmm <- lfmm_ridge(Y=gen6p9g9, X=PC1, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_PC1.pv <- lfmm_test(Y=gen6p9g9, X=PC1, lfmm=oys6p9g9_red_PC1.lfmm, calibrate="gif")
names(oys6p9g9_red_PC1.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.09, which I'm happy with.
oys6p9g9_red_PC1.pv$gif
## [1] 1.087458
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_PC1 <- oys6p9g9_red_PC1.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_PC1 <- oys6p9g9_red_PC1.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.087458
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_PC1 <- pchisq(zscore_6p9g9_red_PC1^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")
hist(adj.pv6p9g9_red_PC1, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_PC1.qv <- qvalue(oys6p9g9_red_PC1.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_PC1.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys6p9g9_red_PC1.FDR.1 <- which(oys6p9g9_red_PC1.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_PC1.FDR.1
## [1] 12105 20651
invisible(lapply(oys6p9g9_red_PC1.FDR.1, write, "outliers_lfmm2_6p9g9_red_PC1.txt", append=TRUE))
PC2
oys6p9g9_red_PC2.lfmm <- lfmm_ridge(Y=gen6p9g9, X=PC2, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_PC2.pv <- lfmm_test(Y=gen6p9g9, X=PC2, lfmm=oys6p9g9_red_PC2.lfmm, calibrate="gif")
names(oys6p9g9_red_PC2.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.05, which I'm happy with.
oys6p9g9_red_PC2.pv$gif
## [1] 1.052841
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_PC2 <- oys6p9g9_red_PC2.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_PC2 <- oys6p9g9_red_PC2.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.052841
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_PC2 <- pchisq(zscore_6p9g9_red_PC2^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.05)")
hist(adj.pv6p9g9_red_PC2, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_PC2.qv <- qvalue(oys6p9g9_red_PC2.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_PC2.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 4
oys6p9g9_red_PC2.FDR.1 <- which(oys6p9g9_red_PC2.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_PC2.FDR.1
## [1] 4960 5947 18936 19091
invisible(lapply(oys6p9g9_red_PC2.FDR.1, write, "outliers_lfmm2_6p9g9_red_PC2.txt", append=TRUE))
PC3
oys6p9g9_red_PC3.lfmm <- lfmm_ridge(Y=gen6p9g9, X=PC3, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys6p9g9_red_PC3.pv <- lfmm_test(Y=gen6p9g9, X=PC3, lfmm=oys6p9g9_red_PC3.lfmm, calibrate="gif")
names(oys6p9g9_red_PC3.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.02, which I'm happy with.
oys6p9g9_red_PC3.pv$gif
## [1] 1.022614
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore_6p9g9_red_PC3 <- oys6p9g9_red_PC3.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif_6p9g9_red_PC3 <- oys6p9g9_red_PC3.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.022614
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv6p9g9_red_PC3 <- pchisq(zscore_6p9g9_red_PC3^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.03)")
hist(adj.pv6p9g9_red_PC3, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys6p9g9_red_PC3.qv <- qvalue(oys6p9g9_red_PC3.pv$calibrated.pvalue)$qvalues
length(which(oys6p9g9_red_PC3.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys6p9g9_red_PC3.FDR.1 <- which(oys6p9g9_red_PC3.qv < 0.1) ## identify which SNPs these are
oys6p9g9_red_PC3.FDR.1
## integer(0)
K <- 1
Sewage Effluent
oys_noinvert_6p9g9_red_SE.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=SE, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_SE.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=SE, lfmm=oys_noinvert_6p9g9_red_SE.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_SE.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.09, which I'm happy with.
oys_noinvert_6p9g9_red_SE.pv$gif
## [1] 1.09136
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_SE <- oys_noinvert_6p9g9_red_SE.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_SE <- oys_noinvert_6p9g9_red_SE.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.09136
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_SE <- pchisq(zscore__noinvert_6p9g9_red_SE^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_SE.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_SE.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")
hist(adj.pv_noinvert_6p9g9_red_SE, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_SE.qv <- qvalue(oys_noinvert_6p9g9_red_SE.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_SE.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_SE.FDR.1 <- which(oys_noinvert_6p9g9_red_SE.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_SE.FDR.1
## integer(0)
Temp Min
oys_noinvert_6p9g9_red_Min_T.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Min_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Min_T.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Min_T, lfmm=oys_noinvert_6p9g9_red_Min_T.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Min_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.08.
oys_noinvert_6p9g9_red_Min_T.pv$gif
## [1] 1.084172
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Min_T <- oys_noinvert_6p9g9_red_Min_T.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Min_T <- oys_noinvert_6p9g9_red_Min_T.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.084172
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Min_T <- pchisq(zscore__noinvert_6p9g9_red_Min_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Min_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Min_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.08)")
hist(adj.pv_noinvert_6p9g9_red_Min_T, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Min_T.qv <- qvalue(oys_noinvert_6p9g9_red_Min_T.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Min_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Min_T.FDR.1 <- which(oys_noinvert_6p9g9_red_Min_T.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Min_T.FDR.1
## [1] 12105
invisible(lapply(oys_noinvert_6p9g9_red_Min_T.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Min_T.txt", append=TRUE))
June Salinity Max
oys_noinvert_6p9g9_red_Max_Sal_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Max_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Max_Sal_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Max_Sal_Jun, lfmm=oys_noinvert_6p9g9_red_Max_Sal_Jun.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.13.
oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$gif
## [1] 1.128308
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Max_Sal_Jun <- oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Max_Sal_Jun <- oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.128308
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Max_Sal_Jun <- pchisq(zscore__noinvert_6p9g9_red_Max_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.13)")
hist(adj.pv_noinvert_6p9g9_red_Max_Sal_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Max_Sal_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Max_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Max_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys_noinvert_6p9g9_red_Max_Sal_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Max_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Max_Sal_Jun.FDR.1
## [1] 1364 12105
invisible(lapply(oys_noinvert_6p9g9_red_Max_Sal_Jun.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Max_Sal_Jun.txt", append=TRUE))
Temp Max
oys_noinvert_6p9g9_red_Max_T.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Max_T, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Max_T.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Max_T, lfmm=oys_noinvert_6p9g9_red_Max_T.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Max_T.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.09, which I'm happy with.
oys_noinvert_6p9g9_red_Max_T.pv$gif
## [1] 1.088141
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Max_T <- oys_noinvert_6p9g9_red_Max_T.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Max_T <- oys_noinvert_6p9g9_red_Max_T.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.088141
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Max_T <- pchisq(zscore__noinvert_6p9g9_red_Max_T^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Max_T.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Max_T.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")
hist(adj.pv_noinvert_6p9g9_red_Max_T, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Max_T.qv <- qvalue(oys_noinvert_6p9g9_red_Max_T.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Max_T.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Max_T.FDR.1 <- which(oys_noinvert_6p9g9_red_Max_T.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Max_T.FDR.1
## [1] 18847
invisible(lapply(oys_noinvert_6p9g9_red_Max_T.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Max_T.txt", append=TRUE))
July Temp Min
oys_noinvert_6p9g9_red_Min_T_Jul.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Min_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Min_T_Jul.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Min_T_Jul, lfmm=oys_noinvert_6p9g9_red_Min_T_Jul.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Min_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys_noinvert_6p9g9_red_Min_T_Jul.pv$gif
## [1] 1.07285
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Min_T_Jul <- oys_noinvert_6p9g9_red_Min_T_Jul.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Min_T_Jul <- oys_noinvert_6p9g9_red_Min_T_Jul.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.07285
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Min_T_Jul <- pchisq(zscore__noinvert_6p9g9_red_Min_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Min_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")
hist(adj.pv_noinvert_6p9g9_red_Min_T_Jul, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Min_T_Jul.qv <- qvalue(oys_noinvert_6p9g9_red_Min_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Min_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys_noinvert_6p9g9_red_Min_T_Jul.FDR.1 <- which(oys_noinvert_6p9g9_red_Min_T_Jul.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Min_T_Jul.FDR.1
## [1] 12372 17596
invisible(lapply(oys_noinvert_6p9g9_red_Min_T_Jul.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Min_T_Jul.txt", append=TRUE))
July Temp Avg
oys_noinvert_6p9g9_red_Avg_T_Jul.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_T_Jul, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_T_Jul.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_T_Jul, lfmm=oys_noinvert_6p9g9_red_Avg_T_Jul.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Avg_T_Jul.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.13
oys_noinvert_6p9g9_red_Avg_T_Jul.pv$gif
## [1] 1.134126
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_T_Jul <- oys_noinvert_6p9g9_red_Avg_T_Jul.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_T_Jul <- oys_noinvert_6p9g9_red_Avg_T_Jul.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.134126
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_T_Jul <- pchisq(zscore__noinvert_6p9g9_red_Avg_T_Jul^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.13)")
hist(adj.pv_noinvert_6p9g9_red_Avg_T_Jul, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_T_Jul.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_T_Jul.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_T_Jul.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 3
oys_noinvert_6p9g9_red_Avg_T_Jul.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_T_Jul.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_T_Jul.FDR.1
## [1] 13926 13927 18847
invisible(lapply(oys_noinvert_6p9g9_red_Avg_T_Jul.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Avg_T_Jul.txt", append=TRUE))
June Salinity Avg
oys_noinvert_6p9g9_red_Avg_Sal_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_Sal_Jun, lfmm=oys_noinvert_6p9g9_red_Avg_Sal_Jun.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$gif
## [1] 1.070078
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_Sal_Jun <- oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_Sal_Jun <- oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.070078
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_Sal_Jun <- pchisq(zscore__noinvert_6p9g9_red_Avg_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")
hist(adj.pv_noinvert_6p9g9_red_Avg_Sal_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_Sal_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Avg_Sal_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_Sal_Jun.FDR.1
## [1] 10757
invisible(lapply(oys_noinvert_6p9g9_red_Avg_Sal_Jun.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Avg_Sal_Jun.txt", append=TRUE))
June Salinity Min
oys_noinvert_6p9g9_red_Min_Sal_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Min_Sal_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Min_Sal_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Min_Sal_Jun, lfmm=oys_noinvert_6p9g9_red_Min_Sal_Jun.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.07, which I'm happy with.
oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$gif
## [1] 1.069792
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Min_Sal_Jun <- oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Min_Sal_Jun <- oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.069792
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Min_Sal_Jun <- pchisq(zscore__noinvert_6p9g9_red_Min_Sal_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.07)")
hist(adj.pv_noinvert_6p9g9_red_Min_Sal_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Min_Sal_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Min_Sal_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Min_Sal_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 1
oys_noinvert_6p9g9_red_Min_Sal_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Min_Sal_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Min_Sal_Jun.FDR.1
## [1] 10757
invisible(lapply(oys_noinvert_6p9g9_red_Min_Sal_Jun.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Min_Sal_Jun.txt", append=TRUE))
pH Avg
oys_noinvert_6p9g9_red_Avg_pH.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_pH, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_pH.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_pH, lfmm=oys_noinvert_6p9g9_red_Avg_pH.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Avg_pH.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.05, which I'm happy with.
oys_noinvert_6p9g9_red_Avg_pH.pv$gif
## [1] 1.054805
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_pH <- oys_noinvert_6p9g9_red_Avg_pH.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_pH <- oys_noinvert_6p9g9_red_Avg_pH.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.054805
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_pH <- pchisq(zscore__noinvert_6p9g9_red_Avg_pH^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_pH.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_pH.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.05)")
hist(adj.pv_noinvert_6p9g9_red_Avg_pH, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_pH.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_pH.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_pH.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_Avg_pH.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_pH.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_pH.FDR.1
## integer(0)
# invisible(lapply(oys_noinvert_6p9g9_red_Avg_pH.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_Avg_pH.txt", append=TRUE))
June DO Avg
oys_noinvert_6p9g9_red_Avg_DO_Jun.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=Avg_DO_Jun, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_Avg_DO_Jun.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=Avg_DO_Jun, lfmm=oys_noinvert_6p9g9_red_Avg_DO_Jun.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.10, which I'm happy with.
oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$gif
## [1] 1.099362
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_Avg_DO_Jun <- oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_Avg_DO_Jun <- oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.099362
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_Avg_DO_Jun <- pchisq(zscore__noinvert_6p9g9_red_Avg_DO_Jun^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.10)")
hist(adj.pv_noinvert_6p9g9_red_Avg_DO_Jun, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_Avg_DO_Jun.qv <- qvalue(oys_noinvert_6p9g9_red_Avg_DO_Jun.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_Avg_DO_Jun.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_Avg_DO_Jun.FDR.1 <- which(oys_noinvert_6p9g9_red_Avg_DO_Jun.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_Avg_DO_Jun.FDR.1
## integer(0)
PC1
oys_noinvert_6p9g9_red_PC1.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=PC1, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_PC1.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=PC1, lfmm=oys_noinvert_6p9g9_red_PC1.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_PC1.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.10, which I'm happy with.
oys_noinvert_6p9g9_red_PC1.pv$gif
## [1] 1.104819
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_PC1 <- oys_noinvert_6p9g9_red_PC1.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_PC1 <- oys_noinvert_6p9g9_red_PC1.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.104819
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_PC1 <- pchisq(zscore__noinvert_6p9g9_red_PC1^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_PC1.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_PC1.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.09)")
hist(adj.pv_noinvert_6p9g9_red_PC1, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_PC1.qv <- qvalue(oys_noinvert_6p9g9_red_PC1.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_PC1.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 2
oys_noinvert_6p9g9_red_PC1.FDR.1 <- which(oys_noinvert_6p9g9_red_PC1.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_PC1.FDR.1
## [1] 12105 18847
invisible(lapply(oys_noinvert_6p9g9_red_PC1.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_PC1.txt", append=TRUE))
PC2
oys_noinvert_6p9g9_red_PC2.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=PC2, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_PC2.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=PC2, lfmm=oys_noinvert_6p9g9_red_PC2.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_PC2.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.06, which I'm happy with.
oys_noinvert_6p9g9_red_PC2.pv$gif
## [1] 1.059911
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_PC2 <- oys_noinvert_6p9g9_red_PC2.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_PC2 <- oys_noinvert_6p9g9_red_PC2.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.059911
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_PC2 <- pchisq(zscore__noinvert_6p9g9_red_PC2^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_PC2.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_PC2.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.06)")
hist(adj.pv_noinvert_6p9g9_red_PC2, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_PC2.qv <- qvalue(oys_noinvert_6p9g9_red_PC2.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_PC2.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_PC2.FDR.1 <- which(oys_noinvert_6p9g9_red_PC2.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_PC2.FDR.1
## integer(0)
# invisible(lapply(oys_noinvert_6p9g9_red_PC2.FDR.1, write, "outliers_lfmm2_noinvert_6p9g9_red_PC2.txt", append=TRUE))
PC3
oys_noinvert_6p9g9_red_PC3.lfmm <- lfmm_ridge(Y=gen_noinvert_6p9g9, X=PC3, K=K) ## change K as you see fit
#calculating test statistics for the predictor
oys_noinvert_6p9g9_red_PC3.pv <- lfmm_test(Y=gen_noinvert_6p9g9, X=PC3, lfmm=oys_noinvert_6p9g9_red_PC3.lfmm, calibrate="gif")
names(oys_noinvert_6p9g9_red_PC3.pv) # this object includes raw z-scores and p-values, as well as GIF-calibrated scores and p-values
## [1] "B" "epsilon.sigma2" "B.sigma2"
## [4] "score" "pvalue" "gif"
## [7] "calibrated.score2" "calibrated.pvalue"
#Looking at the genomic inflation factor (GIF) - a value around 1 means the test(s) is appropriately calibrated. Here it is 1.03, which I'm happy with.
oys_noinvert_6p9g9_red_PC3.pv$gif
## [1] 1.0309
An appropriately calibrated set of tests will have a GIF of around 1. The elevated GIF for our tests indicates that the results may be overly liberal in identifying candidate SNPs. If the GIF is less than one, the test may be too conservative.
# look at how application of the GIF to the p-values impacts the p-value distribution:
hist(oys_noinvert_6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values")
There should be a relatively flat histogram (most loci not under selection) with a peak near zero, indicative of candidate adaptive markers. The peak around 0 is more apparent in the unadjusted p-value histogram.
# Let's change the GIF and readjust the p-values:
zscore__noinvert_6p9g9_red_PC3 <- oys_noinvert_6p9g9_red_PC3.pv$score[,1] # zscores for first predictor, we only have one in our case...
(gif__noinvert_6p9g9_red_PC3 <- oys_noinvert_6p9g9_red_PC3.pv$gif[1]) ## d.fault GIF for this predictor
## [1] 1.0309
new.gif <- 1.00 ## choose your new GIF
# Manual adjustment of the p-values:
adj.pv_noinvert_6p9g9_red_PC3 <- pchisq(zscore__noinvert_6p9g9_red_PC3^2/new.gif, df=1, lower = FALSE)
# plot the p-value histograms with the new gif
hist(oys_noinvert_6p9g9_red_PC3.pv$pvalue[,1], main="Unadjusted p-values")
hist(oys_noinvert_6p9g9_red_PC3.pv$calibrated.pvalue[,1], main="GIF-adjusted p-values (GIF=1.03)")
hist(adj.pv_noinvert_6p9g9_red_PC3, main="REadjusted p-values (GIF=1.00)")
#convert adjusted p-values to q values - q-values provide a measure of each SNP’s significance, automatically taking into account the fact that thousands are simultaneously being tested
# then an FDR threshold can be used to control the number of false positive detections
oys_noinvert_6p9g9_red_PC3.qv <- qvalue(oys_noinvert_6p9g9_red_PC3.pv$calibrated.pvalue)$qvalues
length(which(oys_noinvert_6p9g9_red_PC3.qv < 0.1)) ## how many SNPs have an FDR < 10%?
## [1] 0
oys_noinvert_6p9g9_red_PC3.FDR.1 <- which(oys_noinvert_6p9g9_red_PC3.qv < 0.1) ## identify which SNPs these are
oys_noinvert_6p9g9_red_PC3.FDR.1
## integer(0)
In terminal:
This will save only unique outliers (no duplicates)
$ cat outliers_lfmm2* | sort | uniq > red_combined_unique_outliers_lfmm2_6p9g9.txt
To view:
$ cat red_combined_unique_outliers_lfmm2_6p9g9.txt
output:
10757
12105
12372
1364
13926
13927
16988
17899
18936
19091
19400
20651
4960
5947
no invert outliers
$ cat outliers_lfmm2_noinvert* | sort | uniq > red_combined_unique_outliers_lfmm2_no_invert_6p9g9.txt
To view:
$ cat red_combined_unique_outliers_lfmm2_no_invert_6p9g9.txt
output:
10757
12105
12372
1364
13926
13927
17596
18847
Save outliers to new file
# Convert SNP indices to locus position and chromosome info
$ mawk '!/#/' SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf | cut -f1,2 > totalloci_6p9g9
$ NUM=(`cat totalloci_6p9g9 | wc -l`)
$ paste <(seq 1 $NUM) totalloci_6p9g9 > loci_6p9g9.plus.index
# pcadapt outliers
$ cat outliers_pcadapt_thinned500.txt | parallel "grep -w ^{} loci_6p9g9.plus.index" | cut -f2,3 > outlier_pcadapt_thinned500.loci.txt
# LFMM2_red outliers
$ cat red_combined_unique_outliers_lfmm2_6p9g9.txt | parallel "grep -w ^{} loci_6p9g9.plus.index" | cut -f2,3> outlier_lfmm2_6p9g9_red.loci.txt
$ cat outlier_lfmm2_6p9g9_red.loci.txt
output:
NC_035783.1 10105379
NC_035783.1 35005081
NC_035783.1 42294461
NC_035780.1 24512990
NC_035784.1 8396484
NC_035784.1 8396493
NC_035784.1 65267874 #inside inversion
NC_035784.1 79541695 #inside inversion
NC_035785.1 36953129 #inside inversion
NC_035785.1 41726211 #inside inversion
NC_035786.1 15925374
NC_035787.1 34532963
NC_035781.1 31586020
NC_035781.1 49023293
The four SNPs within the inversion have associations with Avg_pH, Avg_Sal_Jun, and PC2.
Repeating for vcf with no invert
# Convert SNP indices to locus position and chromosome info
$ mawk '!/#/' SNP.TRS6newdp10mafp9g9nDNAmaf052A_noinvert.recode.vcf | cut -f1,2 > totalloci_6p9g9_noinvert
$ NUM=(`cat totalloci_6p9g9_noinvert | wc -l`)
$ paste <(seq 1 $NUM) totalloci_6p9g9_noinvert > loci_6p9g9_noinvert.plus.index
# LFMM2_red_noinvert outliers
$ cat red_combined_unique_outliers_lfmm2_no_invert_6p9g9.txt | parallel "grep -w ^{} loci_6p9g9_noinvert.plus.index" | cut -f2,3> outlier_lfmm2_6p9g9_red_noinvert.loci.txt
$ cat outlier_lfmm2_6p9g9_red_noinvert.loci.txt
output:
NC_035783.1 10105379
NC_035783.1 35005081
NC_035783.1 42294461
NC_035780.1 24512990
NC_035784.1 8396484
NC_035784.1 8396493
NC_035786.1 15925374
NC_035787.1 34532963
All 8 SNPs are shared with the full SNP data set. Of the 6 from the full dataset, 4 were within the inversions. The two that are unique to the full SNP data set are associated with PC2.
To make this easier:
$ cat outlier_pcadapt_thinned500.loci.txt outlier_lfmm2_6p9g9_red.loci.txt > all_6p9g9_red.outliers
$ cat all_6p9g9_red.outliers | sort | uniq | wc -l
output:
70
There are two SNPs shared between pcadapt and lfmm2, both are located in inversions: - NC_035784.1 65267874 - NC_035785.1 41726211
Create VCF file with just neutral loci
$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf --exclude-positions all_6p9g9_red.outliers --recode-INFO-all --out neutralloci6p9g9_red --recode
output:
After filtering, kept 60 out of 60 Individuals
Outputting VCF file...
After filtering, kept 22804 out of a possible 22874 Sites
Run Time = 6.00 seconds
22804 neutral loci in
neutralloci6p9g9_red.recode.vcf.
I’m going to make a second neutral loci vcf file with large inversion regions removed as they drive the population structure.
$ vcftools --vcf neutralloci6p9g9_red.recode.vcf --exclude-bed Detected_large_inversions.bed --recode-INFO-all --out neutralloci6p9g9_red_noinversion --recode
output:
After filtering, kept 60 out of 60 Individuals
Outputting VCF file...
Read 3 BED file entries.
After filtering, kept 21038 out of a possible 22804 Sites
Run Time = 6.00 seconds
Create VCF file with just outlier loci
$ vcftools --vcf SNP.TRS6newdp10mafp9g9nDNAmaf052A.recode.vcf --recode --recode-INFO-all --positions all_6p9g9_red.outliers --out outlierloci6p9g9_red
output:
After filtering, kept 60 out of 60 Individuals
Outputting VCF file...
After filtering, kept 70 out of a possible 22874 Sites
Run Time = 0.00 seconds
70 outlier loci in
outlierloci6p9g9.recode.vcf.
I’m also creating a vcf file of outliers including those unique only
to pcadapt (56 SNPs). This will be used for a redundancy analysis to
identify associations with the distance-based (db) outliers. 56 loci in
outlierloci6p9g9_nolfmm.recode.vcf.
$ vcftools --vcf outlierloci6p9g9_red.recode.vcf --recode --recode-INFO-all --exclude-positions outlier_lfmm2_6p9g9_red.loci.txt --out outlierloci6p9g9_uniq_db
output:
After filtering, kept 60 out of 60 Individuals
Outputting VCF file...
After filtering, kept 56 out of a possible 70 Sites
Run Time = 0.00 seconds
56 loci in
outlierloci6p9g9_uniq_db.recode.vcf.
Now, I’m taking the unique to pcadapt vcf file and splitting it into a set with SNPs inside the inversions and set of SNPs outside the inversions.
# Inversions only
$ vcftools --vcf outlierloci6p9g9_uniq_db.recode.vcf --recode --recode-INFO-all --bed Detected_large_inversions.bed --out outliers6p9g9_uniq_db_invert
output:
After filtering, kept 60 out of 60 Individuals
Outputting VCF file...
Read 3 BED file entries.
After filtering, kept 34 out of a possible 56 Sites
Run Time = 0.00 seconds
# Inversions excluded
$ vcftools --vcf outlierloci6p9g9_uniq_db.recode.vcf --recode --recode-INFO-all --exclude-bed Detected_large_inversions.bed --out outliers6p9g9_uniq_db_noinvert
output:
After filtering, kept 60 out of 60 Individuals
Outputting VCF file...
Read 3 BED file entries.
After filtering, kept 22 out of a possible 56 Sites
Run Time = 0.00 seconds